Leveraging Multilingual News Websites for Building a Kurdish Parallel Corpus
نویسندگان
چکیده
Machine translation has been a major motivation of development in natural language processing. Despite the burgeoning achievements creating more efficient machine systems, thanks to deep learning methods, parallel corpora have remained indispensable for progress field. In an attempt create Kurdish language, this article, we describe our approach retrieving potentially alignable news articles from multi-language websites and manually align them across dialects languages based on lexical similarity transliteration scripts. We present corpus containing 12,327 pairs two Kurdish, Sorani Kurmanji. also provide 1,797 650 English-Kurmanji English-Sorani. The is publicly available under CC BY-NC-SA 4.0 license. 1
منابع مشابه
Building a Multilingual Parallel Subtitle Corpus
In this paper on-going work of creating an extensive multilingual parallel corpus of movie subtitles is presented. The corpus currently contains roughly 23,000 pairs of aligned subtitles covering about 2,700 movies in 29 languages. Subtitles mainly consist of transcribed speech, sometimes in a very condensed way. Insertions, deletions and paraphrases are very frequent which makes them a challen...
متن کاملComparing two acquisition systems for automatically building an English-Croatian parallel corpus from multilingual websites
In this paper we compare two tools for automatically harvesting bitexts from multilingual websites: bitextor and ILSP-FC. We used both tools for crawling 21 multilingual websites from the tourism domain to build a domain-specific English–Croatian parallel corpus. Different settings were tried for both tools and 10,662 unique document pairs were obtained. A sample of about 10% of them was manual...
متن کاملBuilding The Sense-Tagged Multilingual Parallel Corpus
Sense-annotated parallel corpora play a crucial role in natural language processing. This paper introduces our progress in creating such a corpus for Asian languages using English as a pivot, which is the first such corpus for these languages (Chinese, Japanese and Indonesian). Two sets of tools have been developed for sequential and targeted tagging, which are also easy to be set up for any ne...
متن کاملBuilding a multilingual parallel corpus for human users
We present the architecture and the current state of InterCorp, a multilingual parallel corpus centered around Czech, intended primarily for human users and consisting of written texts with a focus on fiction. Following an outline of its recent development and a comparison with some other multilingual parallel corpora we give an overview of the data collection procedure that covers text selecti...
متن کاملBuilding a Parallel Multilingual Corpus (Arabic-Spanish-English)
This paper presents the results (1st phase) of the on-going research in the Computational Linguistics Laboratory at Autónoma University of Madrid (LLI-UAM) aiming at the development of a multi-lingual parallel corpus (Arabic-Spanish-English) aligned on the sentence level and tagged on the POS level. A multilingual parallel corpus which brings together Arabic, Spanish and English is a new resour...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: ACM Transactions on Asian and Low-Resource Language Information Processing
سال: 2022
ISSN: ['2375-4699', '2375-4702']
DOI: https://doi.org/10.1145/3511806